whisper : map token timestamps to original time when VAD is enabled by buxuku · Pull Request #3910 · ggml-org/whisper.cpp

buxuku · 2026-06-25T10:32:35Z

When VAD is enabled, the segment getters (whisper_full_get_segment_t0/t1) already map timestamps back to the original audio timeline, but the per-token timestamps in whisper_full_get_token_data() stay in the VAD-processed timeline with the silence removed. So if you build word-level timing on top of the token times while VAD is on, the words drift off by however much silence VAD stripped out, and there's no public getter that applies the mapping.

This adds whisper_full_get_token_t0/t1 (plus the _from_state variants) that map the token times back. A token inside a speech segment is interpolated within that segment; a token that falls in the silence removed between two segments is snapped to the nearest boundary, so it doesn't end up in the middle of a gap that isn't in the original audio. With VAD off, or when there's no segment info, the stored token times are returned unchanged, so existing callers aren't affected.

I hit this doing word-level re-segmentation with VAD enabled: the segment times lined up with the original audio but the token times didn't. Also extended tests/test-vad-full.cpp to exercise the new getters. Built and ran it on macOS.

danbev

Optional, but perhaps we could extend test-vad-full.cpp to exercise these new functions.

whisper_full_get_token_data().t0/t1 are in the VAD-processed timeline (silences removed), so they don't line up with the original audio. Add whisper_full_get_token_t0/t1 that map them back. A token inside a speech segment is interpolated within it; a token that falls in a removed inter-segment silence snaps to the nearest boundary, so it never lands in the middle of a cut-out gap. Without VAD the raw times are returned unchanged.

buxuku · 2026-06-27T13:36:10Z

@danbev pushed an update on top of the approved version: the token times are now mapped segment by segment instead of one interpolation over the whole mapping table, and a token that lands in a removed silence snaps to the nearest speech boundary rather than somewhere in the middle of a gap that isn't in the original audio. Also extended test-vad-full.cpp to cover the new getters as you suggested. PTAL when you have a moment, thanks!

danbev approved these changes Jun 25, 2026

View reviewed changes

buxuku force-pushed the vad-token-timestamps branch from c008fa5 to e14e08b Compare June 27, 2026 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : map token timestamps to original time when VAD is enabled#3910

whisper : map token timestamps to original time when VAD is enabled#3910
buxuku wants to merge 1 commit into
ggml-org:masterfrom
buxuku:vad-token-timestamps

buxuku commented Jun 25, 2026 •

edited

Loading

Uh oh!

danbev left a comment

Uh oh!

buxuku commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

buxuku commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev left a comment

Choose a reason for hiding this comment

Uh oh!

buxuku commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buxuku commented Jun 25, 2026 •

edited

Loading